-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces manifest.yaml that is the "last working world state" #405
Conversation
The history looks a bit weird. Maybe a rebase is needed? 😀 |
Some nitpicks mostly around naming:
|
Re: Re: mirror_url Re: |
ADD install-pax.sh /usr/local/bin | ||
ADD install-flax.sh /usr/local/bin | ||
ADD install-te.sh /usr/local/bin | ||
# update TE manifest file to install the [test] extras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why here but not in Dockerfile.jax? TE is a part of JAX image now, so I would assume someone wants to test TE in JAX container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. Any reason you kept it separate @yhtang ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I took a look at it, and it appears we could merge the two without any issue since there's just a sed
in Dockerfile.pax.amd64 that replaces the line, so I think it'd be fine to do that in requirements-jax.in
Any objections? @DwarKapex @yhtang
Actually, another issue I found looking at this is that Dockerfile.pax.arm64 doesn't include this sed
. Is that intentional @yhtang ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As I mentioned during the CI sync, IIRC the [test]
extra could not be installed in ARM64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address this as part of #338.
Try to rebase/merge to main again (or create another clean PR), because it is pretty difficult to get an idea what changes are yours vs what changes came from |
SG. I've rebased (preview: #406 ), but it was a pretty gnarly rebase. Without reparenting, just squashing had conflicts. I need to test that I haven't broken anything, but once I've tested, I'll force push here |
db39f6d
to
a7dc646
Compare
2ee2d6e
to
723f91d
Compare
base_image from workflow_dispatch
RUN mkdir -p /opt/pip-tools.d | ||
ADD --chmod=777 \ | ||
get-source.sh \ | ||
pip-finalize.sh \ | ||
/usr/local/bin/ | ||
RUN wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_$(dpkg --print-architecture) -O /usr/local/bin/yq && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a guideline yet regarding third-party PPAs. Let's stick with this simple wget for the time being.
ADD install-pax.sh /usr/local/bin | ||
ADD install-flax.sh /usr/local/bin | ||
ADD install-te.sh /usr/local/bin | ||
# update TE manifest file to install the [test] extras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As I mentioned during the CI sync, IIRC the [test]
extra could not be installed in ARM64.
ADD install-pax.sh /usr/local/bin | ||
ADD install-flax.sh /usr/local/bin | ||
ADD install-te.sh /usr/local/bin | ||
# update TE manifest file to install the [test] extras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address this as part of #338.
|
||
# build lingvo | ||
RUN <<"EOT" bash -exu | ||
set -o pipefail | ||
RUN <<"EOF" bash -exu -o pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used EOT
in consistency with Docker's official guide. This way, you don't have to rename the inner EOF
to EOFINNER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we don't have to rename, but I feel that it's more clear since a question already came up about its significance. This way the identifier is also self-documenting
echo "-e file://${SRC_PATH_PRAXIS}" >> /opt/pip-tools.d/manifest.pax | ||
echo "tensorflow==2.13.0" >> /opt/pip-tools.d/requirements-paxml.in | ||
echo "tensorflow_datasets==4.9.2" >> /opt/pip-tools.d/requirements-paxml.in | ||
echo "chex==0.1.7" >> /opt/pip-tools.d/requirements-paxml.in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding EOT vs EOF, EOT is what Docker's official documentation used for demoing heredoc RUN. But ultimately this choice is arbitrary and up to us to reach a consensus.
@@ -105,6 +100,28 @@ jobs: | |||
BASE_IMAGE: ${{ needs.metadata.outputs.BASE_IMAGE_ARM64 }} | |||
secrets: inherit | |||
|
|||
publish-build-badge: | |||
needs: [metadata, amd64, arm64] | |||
uses: ./.github/workflows/_publish_badge.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_publish_badge.yaml
is deprecated. In the future, we will generate Badge JSONs as part of the job sitrep artifact, and then collectively publish them using _fianlize.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should move to sitrep, but given how large this PR is, I'd like to tackle that in follow-up PR. This shouldn't change the status-quo; in fact it should fix things b/c this badge is referenced on the front-page readme and it was somehow removed in a PR so it hasn't been getting updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I just wanted to raise the issue so that you are aware of the coming change.
publish-build-badge: | ||
needs: [metadata, amd64, arm64] | ||
uses: ./.github/workflows/_publish_badge.yaml | ||
if: always() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of always()
, use !cancelled()
to prevent the step from running even if the workflow is cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think we want always()
. If upstream workflows fail, then this workflow is canceled, but if the badge creation only happens if !cancelled()
, then the badge does not get updated, and we see yesterday's badge.
uses: ./.github/workflows/_publish_badge.yaml | ||
if: ( always() ) | ||
if: always() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"!cancelled()"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded on other thread
# TODO: ARM | ||
publish-build-badge: | ||
needs: [metadata, amd64, arm64] | ||
uses: ./.github/workflows/_publish_badge.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for _publish_badge.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded on other thread
needs: [metadata, amd64, arm64, test-unit, test-t5x] | ||
# TODO: ARM | ||
publish-test-badge: | ||
needs: [metadata, amd64, test-unit-amd64, test-t5x-amd64] | ||
uses: ./.github/workflows/_publish_badge.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for _publish_badge.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded on other thread
I feel that the way that we pass the version bump targets around could be better streamlined. The current trial branch approach seems a little risky. Admittedly, this could be done much more easily if the presubmits and the nightlies can be converged first. |
@yhtang I don't think there's any risk b/c the ability to merge the trial branch isn't included in this change-set. Nightlies will continue to be built via I am in favor of merging this in as it works and the converge PR can make use of the mechanism I am introducing here. |
@@ -197,7 +197,6 @@ t5x/contrib/gpu/scripts_gpu/singlenode_ft_frompile.sh \ | |||
|
|||
# Known Issues | |||
* There is a known sporadic NCCL crash that happens when using the T5x container at node counts greater than or equal to 32 nodes. We will fix this in the next release. The issue is tracked [here](https://github.com/NVIDIA/JAX-Toolbox/issues/194). | |||
* The T5x nightlies disable `NCCL_NVLS_ENABLE=0` ([doc](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable)). Future releases will re-enable this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this was mistakenly deleted during rebase (revert)
I'll move to merge this PR once the new tests are done after the rebasing. Follow-up work could include:
|
Merging as the remaining tests are all matrix MGMN tests (with partially confirmed success) and there are ~200 backlog jobs in the queue at this moment. |
There was a YAML indentation issue when I last time rebased #405 and it ended up causing the comment `#` to enter the image list when we publish PAX nightly images.
As of #405 the presubmit CI is based on package versions listed in the `manifest.yaml` that is committed to the repository. This has not been updated for ~1 month, so the presubmit CI is testing ~1 month old versions of the ecosystem. This PR updates it using the commit from https://github.com/NVIDIA/JAX-Toolbox/tree/znightly-2024-01-03-7395605285, generated by the nightly CI run. Because this bumps the JAX version by ~1 month, we have to include fixes for deprecations. In particular replacing `jax.random.KeyArray` with plain `jax.Array` (nvjax-svc-0/t5x@4d5ec2f). The deprecated name is used in older versions of the `chex` package, which are being selected by pip's dependency resolver despite newer versions being available. We avoid this by giving pip a helping hand and nudging it to use a newer `numpy` version, which allows it to select a newer `chex`. But it's easy to imagine similar issues in future with other packages. Closes #448.
This introduces "the manifest" (
manifest.yaml
) that describes the complete state of the jax stack to allow for reproducible nightly builds and reproducible presubmit CI. The manifest (and patches) are staged in a "trial branch" each night, and if the build succeeds, we can merge the trial branch into main (TODO: GH Issue tracking automating merge). The presubmit CI runs on the PR's git-ref and so unless the author has committed a custom patch/manifest, they will always be running from the "last working state".Description of
manifest.yaml
The manifest allows specifying/pinning libraries that serve different purposes:
fiddle @ git+https://github.com/google/fiddle
Here are example entries of each of these in the manifest.yaml
Git repos
Git repos with patches
^mirror/
file://
URI will be a patch committed into Jax-Toolbox's VC to ensure reproducibilityVCS Constraint
These aren't cloned (in fact
get-source.sh
will error if you try to clone). These are used inpip-finalize.sh
to pin VCS dependencies likeclu @ git+https://github.com/google/CommonLoopUtils#egg=clu # To clu @ git+https://github.com/google/CommonLoopUtils@89c2face3474a7482358068d7a00d9bb6e4b31fe
Pip constraint
Changes to the CI
Nightly JAX Build
bumps the manifest.yaml and patches and commits them to a trial branch. If the tests pass, this trial branch should be merged tomain
bump.sh
that bumps the world state given the manifestREPO_*
andREF_*
build args, since they are all specified in themanifest.yaml
get-source.sh
andcreate-distribution.sh
now take the manifestNot addressed in this PR
~/.gitconfig
mode: pip-constraint
since there was no dep that needed a constraint